Distributed Web-Scale Infrastructure for Crawling, Indexing and Search with Semantic Support

نویسندگان

Stefan Dlugolinsky

Martin Seleng

Michal Laclavik

Ladislav Hluchý

چکیده

In this paper, we describe our work in progress in the scope of web-scale information extraction and information retrieval utilizing distributed computing. We present a distributed architecture built on top of the MapReduce paradigm for information retrieval, information processing and intelligent search supported by spatial capabilities. Proposed architecture is focused on crawling documents in several different formats, information extraction, lightweight semantic annotation of the extracted information, indexing of extracted information and finally on indexing of documents based on the geo-spatial information found in a document. We demonstrate the architecture on two use cases, where the first is search in job offers retrieved from the LinkedIn portal and the second is search in BBC news feeds and discuss several problems we had to face during the implementation. We also discuss spatial search applications for both cases because both LinkedIn job offer pages and BBC news feeds contain a lot of spatial information to extract and process.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Integrating RDF Querying Capabilities into a Distributed Search Infrastructure

The Semantic Web is inherently distributed, and covers both metadata and full-text information. Semantic search therefore can profit a lot from peer-to-peer infrastructures as well as from powerful metadata search functionalities based on full-text search technologies. In this paper we focus on an approach extending an existing P2P search infrastructure with RDF querying capabilities, which bot...

متن کامل

Building the Infrastructure of Resource Sharing: Union Catalogs, Distributed Search, and Cross-Database Linkage

EFFECTIVE R SOURCE SHARING PRESUPPOSES an infrastructure which permits users to locate materials of interest in both print and electonic formats. TWO approaches for providing this are union catalogs and Z39.50-based distributed search systems. The advantages and limitations of each approach are considered, paying particular attention to a realistic assessment of 239.50 implementations. This art...

متن کامل

How to Build Google2Google - An (Incomplete) Recipe

This talk explores aspects relevant for peer-to-peer search infrastructures, which we think are better suited to semantic web search than centralized approaches. It does so in the form of an (incomplete) cookbook recipe, listing necessary ingredients for putting together a distributed search infrastructure. The reader has to be aware, though, that many of these ingredients are research question...

متن کامل

Web-crawling reliability

In this article, I investigate the reliability, in the social science sense, of collecting informetric data about the World Wide Web by Web crawling. The investigation includes a critical examination of the practice of Web crawling and contrasts the results of content crawling with the results of link crawling. It is shown that Web crawling by search engines is intentionally biased and selectiv...

متن کامل

Efficient Proposed Framework for Semantic Search Engine using New Semantic Ranking Algorithm

The amount of information raises billions of databases every year and there is an urgent need to search for that information by a specialize tool called search engine. There are many of search engines available today, but the main challenge in these search engines is that most of them cannot retrieve meaningful information intelligently. The semantic web technology is a solution that keeps data...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

Computer Science (AGH)

دوره 13 شماره

صفحات -

تاریخ انتشار 2012

Distributed Web-Scale Infrastructure for Crawling, Indexing and Search with Semantic Support

نویسندگان

چکیده

منابع مشابه

Integrating RDF Querying Capabilities into a Distributed Search Infrastructure

Building the Infrastructure of Resource Sharing: Union Catalogs, Distributed Search, and Cross-Database Linkage

How to Build Google2Google - An (Incomplete) Recipe

Web-crawling reliability

Efficient Proposed Framework for Semantic Search Engine using New Semantic Ranking Algorithm

عنوان ژورنال:

اشتراک گذاری